Chemical Composition as Predictor of Wine Quality

by Michael Jenkins

Introduction

When shopping for a bottle of wine, it can be difficult to choose due to all
of the available options. These datasets about wine could offer some insight
to make those decisions easier. I have a preference for red wine. So that’s
where I’ll begin.

They contain a sampling of wine chemical composition and a quality rating.
Quality is determined by an average of expert opinions. Compositional
qualities are provided in a variety of measurements across 10 variables. In
total, there are 6,497 observations with 1,599 red and 4,898 whites.

This dataset should give a good exploration between the relationship of wine
quality and chemical composition.

The datasets explored in this analysis were found
here for red and
here for white. The accompanying documentation of
the datasets can be found here.

Citations

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

[@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

Initial Investigation

First a quick look at the data structure.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

There’s not much to go on with just raw numbers. We’ll need to investigate the
accompanying documentation to give context to these numbers. From the
documentation we find the data set contains compositional measurements of
various solutes found within a solution of wine. Also included is a measure of
quality saved as an integer.

This investigation is primarily concerned with predicting wine quality from
wine composition. It only makes sense to start with an investigation of the
quality score. How utilized is the 0-10 quality scale?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

With a minimum of three and a maximum of eight, it’s fair to say the range is
under utilized. With a median near the mean, there’s no reason to suspect
skew. What does give me pause is the 1st and 3rd quartiles are separated by 1.
This would suggest an inordinate occurrences of middling quality range.
Histograms are cheap and will give a quick visualization to confirm.

It appears the dataset does primarily consist of wines of average quality.
A pie-chart would provide an understanding of how homogenous the population
quality is.

Wines of average quality definitely account for a large portion of our sample.
Without performing a calculation, I estimate about 80% of the red wine sample
are of average quality.

It might be valuable to include the sample of white wines to increase the
sample size.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

The datasets share a source. It makes sense they would have compatible
variables. However, white wine does contain about two and a half times the
sample size of red wines.

Remember, the primary purpose of adding the white wine dataset to the
investigation is to increase the granularity of quality ratings. There is the
possibility that including white wine will skew our data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Not much is different. The mean and 1st to 3rd quartiles are unchanged. On a
successful note, the quality range is better utilized. We now have a
population of wines rated as nine.

Perhaps the distribution of quality for white wines has greater utilization.

That’s diappointing. There is still a disproportionate number of average
quality wines in the white sample.

That disproportion is more apparent in a pie chart. Still, A lesser percentage
of white wines are considered average quality. It seems approximately 75% of
the sample is rated at five and six compared to the 80% of reds. So there
was some success in our attempt at gaining granularity.

What Next

There may not be enough of a sample size to predict if a wine is of
exceptionally good or poor quality based on its composition alone. There does
seem to be enough samples to determine the chemical make-up of an average
quality wine.

Perhaps it will be enough to predict if a wine is of average quality. If it
doesn’t, then it’s either very bad or very good. With this thinking, analysis
on average vs. not-average may provide the best insight this data-set can
provide. We might be able to say what makes a wine average.

Delving into the Dataset

Some rudimentary research on wine research turned up
Wine Chemistry on Wikipedia.
From this we see there’s a possibly of some missing measures. For example,
phenolic compounds and proteins. The possibility of missing solute data is
a cause of concern. However, The accompanying documentation makes the claim no
compositional information is missing. We’ll proceed with the information we do
have.

The datasets contain thirteen variables for each observation. Ten of those
variables describe composition. Eight of the ten are measures of
concentration. Alcohol content is provided as a percent. Density provides a
measure of the entire solution. Some of the solute measures are the same.
Some are of different units. These should be converted to equal units of
measure. With equal units, we can make better compositional comparisons.

The remaining variables are not compositional measurements. X is an index that
is not linked to another dataset. It can be discarded. Quality is provided as an integer. There may be cause to convert it to a factor.
Converting quality to a factor would reduce the scale. This may complicate
any future additions to the dataset. Finally there’s Ph. Ph is is a
descriptor of acidity. It may be of interest to explore the relationship
between total acidic composition and Ph.

Points of Interest

Of most interest are the measures of concentration. The mass per volume of
each solute is assumed to have a corelative relationship to the wine’s
quality. Comparing the ratios of a wine’s compostition correlated to the
wine’s quality is the most interesting feature to investigate.

Another interest is the non-compositional value of Ph. There’s a few questions
we can try to answer here. What commonality between Ph values exists at varied
concentrations of acidic solute? Is there a relationship between Ph and
quality?

With both red and white datasets, we can also investigate differences between
the two types. How do their compositions differ? Do those compositional
differences affect the qualitative scale between whites and reds?

Data Preparation

As previously noted, we will need to convert our compositional data to units
of equal measure. First we should consider what options are available and
which best fits our purpose.

The first option is mass per volume. The data is presented to us in this
format. This could be a good measure to perform investigation. Grams per
cubic meter may be a common measure to the drug markets. I’d like a unit of
measure that is more widely understood.

Parts per million (ppm) is a commonly reported unit of concentration. This
would make for easily reported values. What effect would a scale in the
millions have on visualizations?

A solute’s percentage of a solution should cover these concerns. A summation
of solute percent can be used to determine the ratio of solvent in each
solution of wine. Percents should provide a more manageable scale, and are
commonly understood.

If any of these assumptions about percents proves false, it will be easy to
convert back to another better suited measure. 1% is equal to 10,000 ppm and 1
ppm is equal to 1 mg/L. We can also easily convert to grams per serving.
There’s 33.8 ounces in a liter and 5 ounces in a serving.

Having considered the options, we can generate a working dataset. The
following list should cover what has been decided on.

  1. Add a field to note the wine’s color.
  2. Merge the two datasets into one.
  3. Add a logical field to note if the wine is ordinary. (i.e. quality is 5 or 6)
  4. Convert density from grams per cubic centimeter to grams per cubic
    decimeter. (i.e. multiply by 1000)
  5. Convert Sulfur dioxides from milligrams per cubic decimeter to grams per
    cubic decimeter. (i.e. divide by 1000)
  6. Convert solute concentration values to percents. (i.e. divide their value
    by density and multiply by 100)
  7. Add fields for total solute and solvent percents.

Data Preparation Aftermath

Having cleaned up the data, I’d like to take note of the variables added. It
is good to keep track of such things. Just in case we forget why they were
added.

The sum of the percent solutes does not account for the total solution. With
no information on the composition of the remaining solution, and the claim
there are no missing attributes, we must assume the unaccounted solution is a
tasteless solvent (e.g. water). We’re also making the assumption that alcohol
is not a solvent for this case. Higher solvent content may be a factor in
determining quality. We could hypothesize Watered down wine is of lower
quality. These values will allow us to test these hypotheses.

Another variable we added retains information on which dataset the observation
originated from. This will allow us to compare the differences between red and
white wines. As someone with a preference for reds, I think this part is the
most important.

The exceptional vs ordinary variable is intended for exploring what is common
in wines of average quality. In this case, exceptional is defined as not
average. Exceptionally bad and exceptionally good wines have similar numbers
of observations. Comparing what makes for average vs. non-average wine and
then comparing both ends of the extremes may give a more complete picture of
wine quality.

Finally, the solute columns were changed from mass per volume to percents.
Density, by definition is 100% of the solution this is shown in our
calculation. We should also not forget the unit conversions we have. While we
are working with percents, it helps to be able to report whichever value is
most conducive.

1% is equal to:

  • 10,000 ppm
  • 10 g/L
  • 1.48 g per serving

Deeper Investigation

When cleaning up the dataset, a number of questions were raised on the
relationships between variables. A matrix synopsis between the solutes should
give a good direction on where to go.

Universal Comparisons

From this initial comparison, it seems alcohol content is the best predictor
of wine quality. It also suggests the thought of average versus other may not
provide the results hoped for. Based on the distribution of alcohol content by
quality, there’s more sense in breaking quality as less than average and
better than average values. This would regroup the observations with a split
between qualities 5 and 6.

The strongest solute correlation is between free sulfur dioxide and total
sulfur dioxide. This makes sense as one is a subset of the other. Sulfuric
content related to quality does have an interesting phenomena. It appears that
lower quality wines have larger range of sulfur dioxide content.

Acidic content also appears to have a slight impact on wine quality. The upper
and lower bounds of acidic solutes lessens as quality improves. Total acidic
content could be investigated further.

Total solvent and color are missing from the comparisons. They should be added
in the next comparison set. Variable ordering should also be adjusted in the
next comparison set.

Refined Comparisons

This new set has a higher number of correlated values. As alcohol content
increases, total solvent decreases. This suggests that alcohol is a greater
portion of the solution than the other solutes. The opposite is true for
total acid and fixed acidity. As one increases, so does the other. This
suggests that fixed acidity has a higher proportion of total acidity than the
other acidic forms.

Residual sugar appears to be consistent through all observations. This is
surprising due to the relationship between sugar as a fuel for yeast to turn
into alcohol.

Despite the low correlation between pH and total acidity, there does appear to
be a loose relationship. As total acidity increases, the pH drops (becomes
more acidic). There’s also appears to be a slight relationship between acidity
and quality. Fixed acidity, citric acid and pH seems to be consistent through
all grade qualities. Volatile acidity seems to have a higher content in lesser
quality wines.

We start to see hints at the variation in chemical makeup between red and
white wines. White wines tend to have more residual sugar, more sulfur
dioxides, and lower pH. The lower pH of white wines contrasts with the lower
acidic content of white wine. Remember, lower pH means more acidic. The
expectation is higher acidic content would have lower pH. One possible
explanation is fixed vs. other forms of acid.

I did note earlier that the range of total sulfur dioxide is greater at lower
qualities. However, the mean of sulfur dioxide content does remain constant.
The difference in range could be attributed to white wine’s tendency to have a
higher concentration of sulfur dioxides than red wines. There’s some logic to
save these values for a future analysis. One with greater focus on the
compositional differences of red and white wines.

The effect of acidic compounds and pH on quality also appears to be dependent
on wine variety. With all of these differences between red and white, It would
be good to compare red and white composition separately.

Comparisons of Varietals

Red

White

When splitting out the comparisons by color. It’s easier to see how
differences in composition affects quality. Acidic and sulfuric compounds are
a greater predictor of wine quality in reds than whites. Across both types,
alcohol remains the greatest predictor of wine quality. Acidic and sulfuric
compounds will be reserved for future analysis.

Is there something that indirectly influences quality? A relationship between
alcohol content and the remaining compounds? What are the ratios of other
compounds compared to alcohol content?

Distribution of chlorides (salt) across alcohol seems to match the curve of
alcohol content for all types. Less alcoholic wine tends to have more salt.
Aside from some outliers, the percent salt content in red and whites appears
to be similar.

Despite the low correlation, residual sugar and alcohol content also appears
to have a universal trend across the varietals. We can only speculate on the
unknown factors that would produce such a result. One factor could be the
starting sugar. Another might be the yeast used to convert sugar to alcohol.
At what alcohol content does that yeast die? Was the fermentation process
interrupted at a certain point? These are questions we don’t have the answers
too. We shouldn’t ignore this apparent relationship due to unknown influences.

From all of this, the plots between alcohol, salt and sugar should be the
center of our exploration. There appears to be some type of relationship
between alcohol and quality. There’s also appears to be a relationship between
ratios of salt and sugar to alcohol. Before moving forward, it’s only prudent
to take some steps back.

Alcohol and Quality

Distribution of the alcohol content in our samples is right skewed. Are the
samples of higher content skewing the quality contents? So far, we’ve
been directing our thinking on alcohol content and wine quality by the
tendency of higher average alcohol content at higher levels of quality. We
should take a closer view at the data to support this assumption.

There’s quite a bit of overplotting in that graph. Still, there is some
pattern in the extreme qualities.

Wine samples graded as a 9 can be counted on one hand. There’s still a bit of
overplotting. It is more clear to see the clusters of samples rise as quality
improves.

Reducing the alpha a bit and the change in the clusters become more clear.
It’s also clear we can’t say that wine quality will be better just because the
alcohol content is higher. This is different from saying higher quality wines
tend to have higher alcohol content. Do not confuse this with saying lower
quality wines will always have lower alcohol content.

Taking the bins we’ve defined for lesser and better quality shows a clearer
picture. Better quality wines seems evenly dispersed across the range of
alcohol content. It looks that the cut off is just above 10% alcohol
content. Wines with more than 10% alcohol content are more likely to be of
better quality.

Considering we would rather increase than give up resolution of quality,
it would be good take a look at the upper and lower quality wines bins from
another perspective. To facilitate this, let’s recreate the smaller bins
without the added field.

We ended up dropping wines of quality 9. As there are so few observations with
that rating, it could be considered outliers. Basically this is the same graph
as previously created. The benefit of this method is allowing us to maintain
quality granularity within wines of better of lesser quality.

I’m also having some regret in the decision to transform quality into a
factor. That does give more control over the scale. That additional control is
under utilized when quality is on an integer scale.

Taking a closer look at the breakdown of the better quality wines, We see that
wines of quality 6 runs the full spectrum of alcohol content. Disregarding
wines of quality 6, better wines start to occur more frequently at alcohol
content near 10-11%. Wines of the highest quality have even fewer occurrences
with alcohol content under ~9.5%.

The shift for wines of greater vs lesser quality around 10% is more apparent
by limiting the view to the middle quality scores. Wines of quality 6 appear
to be a merge between wines of quality 5 and 7.

Breaking out the least quality wines shows a dispersion that rarely breaks
11.5% alcohol. If the least quality wines rarely rise above 11.5% alcohol and
the highest quality wines rarely have less than 9.5% alcohol, then the
average quality wines must fall within that range.

What would happen to our graphs if we eliminated the ordinary wines defined as
having alcohol content between 9.5% and 11.5%? This also revisits the idea of
average versus non-average wines. We have a definition of what makes for an
ordinary wine. Eliminating those from our views, it should become easier to
see the differences at the extremes.

The cut off between better and lesser quality wines is even easier to see when
we remove the middle range of alcohol content. This split more clearly occurs
at some granularity of quality rated 6.

It’s now clear that quality 6 wines are ambiguous in quality. We should look
back at the lesser vs better quality graph while eliminating the ambiguous
value.

It is now a quite clear relationship between quality and alcohol content. We
should remember the quality scale is subjective and based on the opinion of
experts. It’s possible those experts have a preference for higher alcohol
content in their wines. On the observations made this far, I would suggest
that preference is likely.

As part of due diligence, we should a look at actual value correlation between
quality and alcohol.

## 
##  Pearson's product-moment correlation
## 
## data:  wines$alcohol and as.numeric(wines$quality)
## t = 39.97, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4245892 0.4636261
## sample estimates:
##       cor 
## 0.4443185

Across all samples, there is a moderate correlation between quality and
alcohol content. This correlation includes the ambiguous quality rating. This
is a central value that fits the model but skews the data. This skew appears
to be because of a lack of resolution in the quality scale. What happens to
correlation if we omit this ambiguous score?

## 
##  Pearson's product-moment correlation
## 
## data:  wines.noMiddle$alcohol and as.numeric(wines.noMiddle$quality)
## t = 41.294, df = 3659, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5413057 0.5855142
## sample estimates:
##       cor 
## 0.5638137

By omitting the ambiguous quality value on our scale, the correlation
becomes strong. I would hypothesize, if we had a more granular and less
subjective quality scale the correlation between would be even stronger. For
now, I will state that alcohol is an acceptable measure of quality. It lends
greater granularity than the current quality scale.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.30   10.49   11.30   14.90

The earlier observations on the upper limit of alcohol content for lower
quality wines and lower limit for higher quality wines fall almost exactly on
the 1st and 3rd quartiles of the alcohol content range. This gives even more
confidence in the assertion that alcohol is the measure of quality.

Intermediate Analysis

The quality score provided by the experts lacks in granularity. Luckily, the
experts tend to rate wines of higher alcohol content more favorably. By
equating alcohol content with quality, we have better granularity for
determining the composition of higher quality wines.

On the compositional side of things, salt and sugar seem to have the most
consistent relationship to alcohol across both varietals. There are compounds,
such as citric acid, with stronger relationships to alcohol content. Those
relations do not appear to be nearly as consistent across varieties.

Moving forward in our exploration, we’ll focus on alcohol, salt, sugar, and
wine variety. The other interesting variables will be saved for future
analysis.

Deeper Look

To complete our investigation, we’ll start by investigating how salt
concentration compares at various alcohol levels.

Salt

The majority of wines have less than a hundredth of a percent of salt. There’s
a jump in salt content around 9.5 percent alcohol. Most of them appear to be
outliers. Some scaling and subsetting should produce a better picture.

There’s a consistent fraction of a fractional drop in salt content from
~0.0047% to ~0.0035% as alcohol content increases from ~9% to ~13%. These
values extend just outside the ~9.5% to ~11.5% range we previously determined
as average quality. As wine quality increases, salt content decreases.

What about sugar content?

Sugar

Again, the data is condensed on itself. Some adjustment on perspective can
only help.

There’s a large amount of residual sugars in wines of low alcohol content.
This seems logical. Sugar is what yeast eats to make alcohol. Less sugar
consumed means more left over sugar and less alcohol.

There appears to be an interesting bump in sugar content near the delineation
of average and better than average wines. Of all the possibilities that could
explain that bump, the one we have is variety.

That doesn’t answer anything about the bump. If anything, the bump is more
pronounced. The interesting thing it did illustrate are the differences in
red and white wines. It seems there’s two graphs of sugar content stacked on
each other.

The lower graph is comprised of the red wines. Which shows a consistent level
of residual sugar content at all levels of alcohol. In the upper graph of
white wines, sugar drops sharply as alcohol increases before leveling out at
the highest levels.

Salt Revisited

Is there a similar phenomenon in salt content?

White wines have less salt than reds. White wines also have a steeper drop in
salt content as alcohol content increases compared to red. A similar
phenomenon except red wines tend to have more salt than white wines.

Sugar in Reds

Returning to sugar content, It might be helpful to break reds and whites to
their individual scales.

Unlike white wines, red wines appear to have a slight increase in sugar
content as alcohol increases. It’s so slight, starting around 0.225 and
ending near 0.25. Given the low sample size and lack of measurement error,
the best we can say is red wine has consistent sugar content across all
values. There may be some miniscule differences, but there’s not enough
information to say sugar content is a contributing factor in red wine quality.

To be sure, we should check against the original quality scale.

The majority of values reside under 0.4% sugar content. Adjusting both graphs
might yield a better image. Before making this adjustment, it’s worth noting
the 1st quartile and mean is fairly consistent across quality range. There is
some variation in the 3rd quartile, but no consistent pattern.

Removing these outliers produced a more consistent pattern in both graphs. The
sugar content for the range of alcohol content is more flat.
Sugar content at each quality score has a consistent mean with variation
occurring in the 1st and 3rd quartiles.

As an aside, the similarities in these graphs reinforces our earlier assertion
that alcohol content is a good predictor of wine quality. Unfortunately, these
findings are contrary to our earlier observations that sugar content related
to alcohol has a consistent relationship across varietals.

Putting it All Together

One final graph I want to develop is alcohol content at ratios of salts and
sugar.

What we see is a concentration of higher alcohol content at lower salt and
sugar content. Some of that could be attributed to the intersection of red and
white wines.

Red wine tends to have a much lower sugar content and higher salt content than
white wine. That could explain the lower leg of this graph.

Another thing to consider is the range of granularity alcohol content
provides. This spread of alcohol content across the ratios of sugar to salt
could become more apparent by introducing more colors to our graph.

At three colors, it seems more clear that lower alcohol content wines tend to
occur at the extreme values of sugar and salt content.

At four colors, diminishing returns start to kick in. It does help solidify
what we’ve already seen. There’s slightly more resolution on the ratios of
sugar and salt where the highest alcohol content wines occur. The highest
concentration of red dots appear near 0.0025% salt and 0.25% sugar.

It’s worth noting that what has been referred to this far as a ratio between
two compounds is a misnomer. It describes 100 grams of sugar for every gram of
salt. We do see a higher concentration of the higher alcohol content wines
around 0.3% sugar and 0.003% salt. We do not see any examples at 2% sugar and
0.02% salt. There is no diagonal line that would describe a perfect ratio.

To be more accurate, we should say that the highest alcohol content wines
occur between 0.003% and 0.004% salt content and 0.1 to 0.75% sugar content.
Looking closer at this range could provide additional value.

The Happy Spot

Zooming in on this range we can see that it isn’t a true ratio. It’s more of a
happy spot where the balance between salt and sugar isn’t too much in one
direction or the other. There’s a bit of noise at this scale with 3 colors.

Gradation of one color from black to white provides a good contrast. We can
better see where there’s not enough sugar for the amount of salt in that
bright blue. Just above the bright blue, we see a distribution of higher
alcohol content wines extending into higher sugar content.

Zooming back out with this color scale, the happy spot is more pronounced.
The trade off is a loss of pronouncement in the extreme values.

Red vs. Whites

One thing that we’ve touched on previously are the composition of varietals
and how they differ even where there’s similarities.

Red wine’s contribution to the happy spot is clear and not unexpected. Red
wines have a consistent sugar content and typically have move salt. As such
they make up the lower leg of the salt to sugar distribution.

White wine’s contribution to the happy spot is also quite clear. Although,
white wine seems to be more forgiving on sugar content. Instead of saying
white wines are more accepting of high sugar content; we should say that
higher levels of salt need lower sugar content to remain at high quality.

Multivariate Analysis

Based on what we’ve seen, salt content has more impact on wine quality than
sugar. Even with sugar rich white wines, salt content is the largest factor
for determining quality. In white wines, you can have more sugar in a higher
quality wine as long as the salt content remains in a minimal range.

Between red and white wines, there is a convergence where wines of the highest
alcohol content occurs with a balance of sugar and salt. This sweet spot
provides some answers to the question of how wine composition relates to
quality. What is not clear is how alcohol content and the balance of salt and
sugar relate to quality. Are the ratios of salt and sugar the ultimate
determinant to quality or does the alcohol content also contribute.

Alcohol content as a granular measure of quality is something that shouldn’t
escape scrutiny. In this analysis, alcohol content was taken as a surrogate
for quality. It offered a level of granularity not present in the quality
scale. One issue with this is it may hide the relationship of other solutes in
the solution and how the combination of those solutes relate to quality.

Final Plots and Summary

Quality

Using Alcohol as Quality

The data set has an issue of granularity within the quality scale. Without a
a greater level of granularity, the delineation between good an bad quality
wines is lost. Alcohol content was identified as a surrogate to quality. It
offers a greater level of granularity.

This granularity and logic of alcohol content as quality is illustrated here.
The exceptionally good and exceptionally bad wines have a lack of observation
density to produce consistency with this assumption. However, for the
observations we do have, we see the mean of alcohol content at lesser quality
wines are much lower than higher quality wines.

Salt Content of Varietals

Quality of Salt in Wine

Using alcohol as a granular measurement of quality revealed salt
content as the major factor in determining wine quality. Across both
varietals, as alcohol content increases, salt content decreases. The better
wines have less salt.

The amount of salt in the wine samples is miniscule. Where alcohol can be
measured in full percents by volume, salt is better described in parts per
million. It is interesting that slight variation in a small amount can have
such a great effect on quality.

We also start to see a divergence between the two main varietals of wine. Red
wines tend to have almost twice the amount of salt of white wines. This
suggests that quality is reliant on a combination of compounds.

Alcohol, Salt and Sugar Content

Composition of Varietal Quality

We now start to see the relationship between combinations of compounds and
quality. Also apparent are the differences and similarities between red and
white wines. There’s a convergence between both varietals for ratios of salt
and sugar where the highest alcohol content occurs. This sweet spot for salt
and sugar is represented in the darkened area of the graph.

There’s a range of salt that is acceptable for reds and whites. There’s a
similar range for sugar as well. The difference for varietals is which solute
is more forgiving in quality. More salt is acceptable in red wines as long as
it remains within a limitation of sugar. The inverse is true for white wines.

Reflection

One thing that strikes me about this data is how worthless the quality score
is. Asking someone to rate something on a scale of one to whatever does offer
some insight. The question here is, how many opinions using that scale is
needed to get a usable granularity? It’s a great starting point with limited
use.

Despite the limitations of quality score, the dataset does offer a depth of
possibilities. Differences of composition between varietals absent of
quality considerations is the top of my list. While alcohol content was
chosen as a surrogate to quality, it still is a measure of alcohol content. It
could be that alcohol is a good measure of quality. I would argue that the
polled experts have a bias for alcohol content. What makes them experts and
what criteria do they use in rating decisions?

There’s also many items noted in the initial pass of the data that were not
explored further. One compound not explored and deserving a future analysis
are the acids. There’s enough evidence to explore the relationship between
citric acid, sugar and salt. There’s also an expectation that Ph is influenced
by acidic compounds. Expectations should always be tested.

At the end of it, the gleaned insight is something I look forward to testing.
Next time I’m in the store for a bottle of wine, I’m going to use alcohol
content as my primary deciding factor. It will be interesting how many bottles
go through before it proves to be a bad metric.